View Rendered Notebook Demo-Mini

Demo of PDL's Companies Dataset (Mini)

This is a quick notebook to demonstrate some of the things you can do with the PDL's companies dataset. We'll use a subset of the dataset containing just the companies based within the US (to limit resource usage and keep this notebook somewhat interactive). Below is an overview of the things we'll cover in this notebook:

Overview

  1. Brief Exploratory Analysis
  2. Show a map of the top 10 largest companies in the US
  3. Show the relationship between company age and company size
  4. Show the most common industry in pittsburgh

Getting setup

First let's get our environment setup and load in the necessary modules and data

0. Exploratory Analysis

Next, let's take a quick look at the dataset to better understand the data it contains

Number of Companies in Each Industry

Seems like the most common industries represented in this dataset are:

  1. Construction
  2. Information Technology and Services
  3. Marketing and Advertising

Number of Companies Founded Each Year

It's interesting to note that the majority of companies in the dataset were founded between 2010-2015; after that there is noticeably less data for more recent companies.

Number of Companies in Each US City

This figure shows the Power Law Distribution of companies within US cities (where most of the companies are concentrated in relatively few cities).

While we're at it, let's also see what this looks like geographically.

1. Map the 10 Largest Companies in the US

Now as our first real task, let's find the 10 largest companies in the US and plot them on a map.

2. Relationship between Company Age and Company Size

Next, let's take a look and see if we can find any relationship between the age of a company and it's size.

Perhaps unsurprisingly, there doesn't seem to be any correlation between a company's size and the year in which it was founded. In fact, what seems to stand out the most are the outliers, where unexpectedly large companies seem to pop up with equal probability across the years.

3. Comparing Industries Between Cities

Now, let's look at the how industries vary across different cities. We know from our previous cell that Chicago, San Francisco, and Houston have a similar number of companies, so let's see how they are distributed across industries.

From this figure, it seems like Houston stands out for it's oil & energy as well as it's construction industries, while Chicago seems to have more companies in it's marketing & advertising as well as it's financial services industries. Unsurprisingly, San Francisco dominates in its internet, computer software, and information technology industries.

Wrapping Up

As you can see, there are a lot of interesting findings and useful insights that can be found by looking into this dataset. Hopefully, this gives you some ideas!